Goto

Collaborating Authors

 neural magic


Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Agarwalla, Abhinav, Gupta, Abhay, Marques, Alexandre, Pandit, Shubhra, Goin, Michael, Kurtic, Eldar, Leong, Kevin, Nguyen, Tuan, Salem, Mahmoud, Alistarh, Dan, Lie, Sean, Kurtz, Mark

arXiv.org Artificial Intelligence

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.


SparseGPT: Remove 100B Parameters For Free - Neural Magic

#artificialintelligence

Large language models (LLMs) solve natural language processing problems with astounding accuracy. However, these models are enormous and require a lot of space, cost, and computation power to deploy. For example, the GPT-175B model has 175 billion parameters requiring 320GB of storage and at least 5 A100 GPUs with 80GB of memory each for inference. This computation power is expensive, making this solution only viable for some organizations. Hence, deployment of such models falls outside the purview of small organizations and individuals.


Machine Learning Engineer at Neural Magic - Somerville, Massachusetts, United States

#artificialintelligence

Neural Magic is an early-stage AI software company democratizing high performance for deep learning models. Our goal is to reduce the cost and increase the performance of end-users deploying deep learning applications. Based on decades of research at MIT, Neural Magic has developed a software platform that allows developers to sparsify deep learning models to minimize footprint and run on CPUs at GPU speeds. Please look through our website and GitHub repos to get a feel of what we are about. Founded by an award-winning team of computer scientists and researchers out of MIT, we are a venture-backed company headquartered in Davis Square, Somerville, MA.


Fast DistilBERT on CPUs

Shen, Haihao, Zafrir, Ofir, Dong, Bo, Meng, Hengyu, Ye, Xinyu, Wang, Zhe, Ding, Yi, Chang, Hanwen, Boudoukh, Guy, Wasserblat, Moshe

arXiv.org Artificial Intelligence

Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime.


Classify Finance Tweets Faster Using Sparsity - Neural Magic

#artificialintelligence

The world of finance and stock trading has changed in recent years. As more and more retail investors enter the market, the more important stories and social sentiment become. Think Tesla - one can argue that a lot of the company's value comes from successful social storytelling by its CEO Elon Musk. Social media has the power to turn a bull into a bear and a bear into a bull. Classifying finance tweets using NLP to understand social sentiment is increasingly more important.


Sparse Transformers

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. If you want to analyze how fast 19 sparse BERT models perform inference, you'll only need a YAML file and 16GB of RAM to find out.


The NLP Cypher

#artificialintelligence

With the great engineering minds at Neural Magic, we're all actively attempting to solve a very difficult problem. How do we get these large models into production without blowing up our hardware or our wallet? We all want the same robust performance with our deep learning models. We want them to be accurate, as light as possible, and fast. So… how do we achieve this?


Neural Magic Announces $30 Million Series A Funding Led by NEA

#artificialintelligence

Neural Magic, the AI company building a software platform for deep learning inference, announced a $30 million Series A funding round led by existing investor NEA with participation from Andreessen Horowitz, Amdocs, Comcast Ventures, Pillar VC, and Ridgeline Ventures. This financing brings the company's total amount raised to $50 million. The new capital will be used to advance Neural Magic's leadership in pure software machine learning acceleration and to support the success of a growing community of developers. Born out of MIT, Neural Magic creates groundbreaking algorithms and tools that bring software, rather than specialized hardware, to the center stage in machine learning (ML) infrastructure. The company creates machine learning models that deliver GPU class performance on commodity CPU hardware, creating a flexible world of AI delivered and executed purely in software.


Neural Magic, which offers software for growing edge AI market, gets $30 million boost

#artificialintelligence

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Neural Magic, a company that offers software to allow deep learning to be deployed more easily in edge locations, today announced a $30 million Series A funding. The market for edge AI is exploding, as more companies deploy it in a variety of applications across industries -- including in areas like asset maintenance and monitoring, factory automation, and telehealth. The market is expected to be worth $1.83 billion by 2026, according to one report by Markets and Markets. But increasingly, customer accelerator chips made by companies like Google and Nvidia to do this "inference" on the edge are unable to keep up with the improvements in efficiency, speed and cost that additional software approaches offer, like the one pushed by Neural Magic.


The startup making deep learning possible without specialized hardware

MIT Technology Review

GPUs became the hardware of choice for deep learning largely by coincidence. The chips were initially designed to quickly render graphics in applications such as video games. Unlike CPUs, which have four to eight complex cores for doing a variety of computation, GPUs have hundreds of simple cores that can perform only specific operations--but the cores can tackle their operations at the same time rather than one after another, shrinking the time it takes to complete an intensive computation. It didn't take long for the AI research community to realize that this massive parallelization also makes GPUs great for deep learning. Like graphics-rendering, deep learning involves simple mathematical calculations performed hundreds of thousands of times.